Extracting Instagram Data – Part 1

For the next two posts, I’m going to introduce some of the techniques I’ve been using to mine content on Instagram, without using their API. This will include extracting a set of posts for a particular hashtag, extracting user information, and extracting a public users timeline. This first post will introduce some of the concepts of Instagram’s REST service it’s front-end JS client uses to load data from, and start by showing how we can use it to extract posts containing a hashtag.

If you’re just interested in the python code for this example, it can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.

Before I go any further, I’d like to point out this is not an exhaustive search of hashtags. It only returns a subset, limited to what Instagram terms “Most Recent”.

This approach looks at using Instagram’s ‘Explore’ search function: https://www.instagram.com/explore/tags/

For example, if we want to explore #food, we can use the query: https://www.instagram.com/explore/tags/food/

When querying this URL, we are greeted with a search page from Instagram showing us top posts, and most recent. If we scroll down, there is a load more button, and an infitie scroll displaying more posts. This approach will be mining that infinite scroll.

First, lets use a cool trick with Instagram. On most pages containing database data, if you append the query param __a=1, you can get the JSON data for that page.

In our food example https://www.instagram.com/explore/tags/food/?__a=1 will give us the following results:

    tag: {
        media: {
            count: 195222596,
            page_info: {...},
            nodes: [...]
        content_advisory: null,
        top_posts: {
            nodes: [...]
        name: "food"

Now as you can see, there’s a lot of information there, but it matches up with areas on the page. ‘media’ contains information about the most recent posts, while top_posts has our “Top Posts” list.

We also find the content of each post are in “node” arrays. A mocked up example can be found below:

    code: "AjfIaiwelA",
    dimensions: {
        width: 1080,
        height: 1080
    comments_disabled: false,
    owner: {
        id: "34578678"
    comments: {
        count: 5
    caption: "caption of photo",
    likes: {
        count: 3
    date: 1480069513,
    thumbnail_src: "http://thumbnail.src",
    is_video: true,
    id: "1391246372820331800",
    display_src: "http://image.src"

Given a JSON object like that, we can easily extract the content using something like Python or Groovy, and convert it into a usable object, or just save the document directly into a JSON database like MongoDB.

Now obviously, not every available post will be in the first query, so next we must deal with paging. If we go back to our ‘media’ object, there is a ‘page_info’ object containing some cursor information. Here we have a useful boolean value has_next_page that can indicate whether there is a next page, and and end_cursor we can use in our query.

The next step in reverse engineering a system like this is to have a look at the network request for when we click the “Load more” button at the end of the page. This triggers a POST query to “https://www.instagram.com/query/” which is Instagrams query end point.

If we look at the query, we see there are some parameters. Here’s an example:

q: ig_hashtag(food) {
    media.after(J0HWFJqMAAAAF0HWFJqHAAAAFjwA, 10) {
        nodes {
            comments {
            dimensions {
            likes {
            owner {
ref: tags::show

Interestingly we see that a good bulk of that reflects the response we get back in the JSON code. In fact, you can even remove some of the parameters in the query, to make a smaller request. Something else to point out here is our end cursor is found in the query media.after(), and we can even bump up the number of posts returned from 10, to a slightly higher number.

Now we have the query sorted out, let’s take a look at the headers.

accept-encoding:gzip, deflate, br
cookie: "Long cookie string"... csrftoken=NEBpBZXbmq0dqh9mBTfixxm4L9psWdrE;
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36

Typically with a system like this, I’ll include all of the static headers in a browser anyway, but here the important values come from the x-csrftoken. X-csrftokens are used against CSRF attacks, so you need to obtain one. Typically, I’ll make a head query to instagrams homepage, and extract the whole cookie string, and the csrftoken value in the cookie string to use as the value for my x-csrftoken parameter.

So with all of that information, you can now go and make a POST request to Instagram for the next page in the results.

The rest of the process is then just repeating the same steps as above. The only notable difference is, rather than having a parent ‘tag’ object which our ‘media’ object was enclosed in, all further POST requests will just return the ‘media’ tag, so you don’t have to include the ‘tag’ when accessing the JSON response.

I’ve written a quick python example for the whole process, and can be found at https://github.com/tomkdickinson/Instagram-Search-API-Python.

In a subsequent post, I’ll show how to extract comments, likes, and a full public timeline.

55 thoughts on “Extracting Instagram Data – Part 1

  1. Hey Tom, how are you?

    Did you manage to use a similar post request to retrieve followers from a given user?
    I believe I’ve done pretty much the same as you (but with different parameters) with no success…


  2. Hi Diego,

    I’m very good thank you, hope you’re doing great too!

    I haven’t actually tried extracting those before, but having a look through at it, it uses the same request structure as extracting comments.

    I might have a crack at modifying my script to do it, and let you know if I’m successful.

    Off the top of my head, have you made sure you’re include a CSRF token header, with the CSRF token Instagram gives you? As well as supplying a UserAgent string that mimics a browser like Chrome?


    • I think so!

      I managed to login, since followers list won’t be shown in anonymous browsing, and post a request with the data structure observed in developer mode.
      The response code is fine (200) but no data is shown, only the number of nodes in such list.

      I came across your post by chance, since I refuse to believe that it might just be some sort of protection from instagram to avoid 3rd party APIs.

      But well, it might just be the case..
      I am assuming a coding error and will use your script to avoid conceptual mistakes.
      I will let you know if any progress is made.


  3. Hi! I was wondering if you had any advice regarding scraping the JSON data for extracting comments on a post. I’ve got some code working which picks off the comments and the users who made them. But my issue is that not ALL of the comments are displayed in the JSON data due to the “load more comments” feature. This is written in the JSON using “has_next_page” / “end_cursor” like pagination. I know how to deal with the pagination of a profile’s main page i.e. https://www.instagram.com/name/?__a=1&max_id=%5Bvalue from end_cursor]. However, when I try to use this method to page through the list of comments, it does not work. Perhaps there is a different query (not ‘?__a=1’)? Thanks!

    • So the pagination for the comments needs to be POSTed with the same /query url that the pagination for retrieving more posts uses. However, there are some differences in the parameters you post. I am planning on writing something about comment collection, but my immediate advice would be to open up something like dev tools in chrome, and look at the AJAX query that is sent when you click “load more comments”. You’ll see all the query parameters that need to be sent, as well as how you can modify it to smoothly scroll through the comments with the end_cursor parameter.

  4. Why do you use ‘__a’? It works, but how have found it out? Why not ‘__b’, for example? Where did you get information from? Which initial information or inferences help you to discover that? How can we define whether an iternal service/ site has an opportunity to get data directly through json? Which parameter should we use? Is ‘a’ a standard in the industry?
    Have fun!

    • The URL format isn’t really a standard in the industry (at least not one I’ve come across), but the way that a page may make an AJAX request and present that information is. Typically if you understand REST systems, or systems that don’t adhere to REST but use AJAX, you can reverse engineer someone else’s system.

      As for the URL, you find that information out by using something like chrome dev tools, or some other network requests analysing tool. That way you can see which URL’s are loaded when certain events happen on a page.

      For example, when on Instagram you can press F12 in Chrome, and go to the Network tab. If you go to someones profile, you’ll see a request made to “screen_name/?__a=1”, and if you look at the response or preview of that request, you see a JSON document with most of the information on the page.

      Typically the rule of thumb I employ when writing scripts like this, is to pretend my script is a browser. If you copy all the requests that your browser makes when accessing pages like that, you can typically make a non optimised script that works, and then it’s just trial and error to see what each parameter does.

  5. Hi Tom!

    I really like your concept, and the example script you provided could get some very useful for data for a semester project of mine. However, I am new to Python and could not figure out how to go over the next pages, so I still can only get the posts from the first page. May I ask you for some hints on this, like how to tweak the example code? Thanks a lot!

  6. Hi Tom,

    This string ‘__a=1’ was cool and helpful, only one thing I am checking now is If i am not logged into Instagram then the link will stop working. I feel that was working before even without being logged into Instagram and now stopped!

    Let me know if there is something that I have missed!


  7. Hi
    Very useful this script. I ´a m beginner programmer. I trying to do this in a java. But I have difficult to understand the POST query String for more results.

    I know how to get all the information to put in the string. But I don’t know how to format the string after the “/query/”

    “https://www.instagram.com/query/” _______


  8. Hi Tom! Thank you very much for your solution! I was wondering if you can give me a tip how to extract username if I get just userid. Thanks a lot in advance!

    • Me too, would be super useful to get a user’s bio (I’m trying to extract location). Do you think it is actually possible? From what I’ve done so far, I’m thinking not? Cheers!

      • try this user_url = “https://i.instagram.com/api/v1/users/” + user_id + “/info/” make req and get json.
        then your_json_response[“user”][“username”] for username or your_json_response[“user”][“biography”] for bio

        hope this hepls

  9. Amazing job! Im using your python script for my university project.
    But i have a problem after 14-20 succesful instagram posts scraped… there is an error:
    ERROR:root:Expecting value: line 1 column 1 (char 0)
    Does anybody know this issue ? and a possible solution?

    • Hi Max,

      Sorry for the slow reply, I’ve been very busy recently.

      Instagram has updated their code, so the old solution was invalid.

      However, I’ve spent some time today fixing this, so the script should now extract hashtags again.


  10. Is it possible to retrieve username from the users id,When I do a hashtag search the id of the owner of the post is retrieved but I have no way of retrieving the usernames of the users who posted that media,

  11. do you have some similar working code to extract followers list? it seems like IG implemented a force query_id param but got no clue how that is generated

  12. Hello,
    May I ask for the link to the Part 2 post? I’m very interested in finding out how to extract the comments and likes data…

  13. One quick question, what kind of limits exist for running GET requests with a logged in user account? I’m assuming Instagram monitor the number of queries being made by an account and flag if they exceed some specified limit…

    • I’m not sure what the limits are, but I have had some limitations in the past when running the script in multiple threads. I found more than 3 threads from one IP could occasionally cause issues, but to be honest, its mostly trial and error.

    • Not to my knowledge. Instagrams search functionality seems to just be a single hashtag as I think it uses it as an indexed key to bring back data, rather than actually searching over the data it stores (if that makes sense?)

  14. did u know that u can crawl instagram even without cookies?
    if you know good for you if u dont know contact me

  15. Hi Tom. Thank you for your best website. I have a question. when you scrapping data was Instagram ‘s API open access or not?

  16. Hi Tom – came here trying to figure out how to get a second (or third, etc) page of posts – just the basic post info, but being able to then compile a list of all of a user’s public posts with info on likes & comments. Can’t figure out the ‘end_cursor’ or ‘after’ params from looking at requests in the console – do you know if there’s a way to do that with a basic url request, like ?__a=1&page=2 or similar?

  17. Hey Tom,

    I have instagram data I have cleaned and wish to create network analysis data for analyzing an Actor Network. Do you have any python code to convert a CSV file from the original data into a network analysis dataset?

    Thanks so much for your time.


  18. Instagram made an update that breaks the current code:

    1) Initial query result JSON structure has changed. Media can now be found at:

    2) Need to look in ConsumerCommon.js to find the queryId for subsequent page queries.

    3) Update regex to “(?<=queryId:\")[0-9A-z]{32,32}"

    4) Subsequent query URL has changed:
    – Parameter is "query_hash" instead of "query_id"
    – "variables" parameter added that contains addtional query string parameters:

    GET /graphql/query/?query_hash=298b92c8d7cad703f7565aa892ede943&variables={"tag_name":"theTagNameYourLookingFor","first":9,"after":"J0HWnEKqwAAAF0HWkF56QAAAFnIA"}

  19. Hey,

    do you know how it is possible to make a sorted (by most followed) list of the accounts your followers are following?

  20. Thanks so much for this work. I am super new with Python but I could understand a bit of your coding.

    of course not all of them.

    I am finance person.

    I wish to change this a bit to add url or user id in the result.

    I am pretty sure there is way of doing this. Can you get me little tip for me?

Leave a Reply

Your email address will not be published. Required fields are marked *